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Abstract 

This paper analytically studies the performance of a synchronous conservative par- 
allel discrete-event simulation protocol. The class of simulation models considered are 
oriented around a physical domain, and possess a limited ability to predict future be- 
havior. Using a stochastic model we show that as the volume of simulation activity 
in the model increases relative to a fixed architecture, the complexity of the average 
per-event overhead due to synchronization, event list manipulation, lookahead calcu- 
lations, and processor idle time approaches the complexity of the average per-event 
overhead of a serial simulation. The method is therefore within a constant factor of 
optimal. Our analysis demonstrates that on large problems — those for which parallel 
processing is ideally suited — there is often enough parallel workload so that processors 
are not usually idle. We also demonstrate the viability of the method empirically, 
showing how good performance is achieved on large problems using a thirty-two node 
Intel iPSC/2 distributed memory multiprocessor. 


*Supported in part by the Virginia Center for Innovative Technology, by NASA grants NAG-1-060 and 
NAS1-18605, and NSF grant ASC 8819393 
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1 Introduction 


The problem of parallelizing discrete-event simulations has received a great deal of attention 
in the last several years. Simulations pose unique synchronization constraints due to their 
underlying sense of time. When the simulation model can be simultaneously changed by 
different processors, actions by one processor can affect actions by another. One must not 
simulate any element of the model too far ahead of any other in simulation time, to avoid the 
risk of having its logical past affected. Alternately, one must be prepared to fix the logical 
past of any element determined to have been simulated too far. 

Two schools of thought have emerged concerning synchronization. The conservative 
school [5], [13], [23], [24] employs methods which prevent any processor from simulating 
beyond a point at which another processor might affect it. These synchronization points 
need to be re-established periodically to allow the simulation to progress. Early efforts 
focussed on finding protocols which were either free from deadlock, or which detected and 
corrected deadlock [17]. The optimistic school [7] allows a processor to simulate as far forward 
in time as it wants, without regard for the risk of having its simulation past affected. If its 
past is changed (due to interaction with a processor farther behind in simulation time) it 
must then be able to “rollback” in time at least that far, and must cancel any erroneous 
actions it has taken in its false future. 

Conservative protocols are sometimes faulted for leaving processors idle, due to overly 
pessimistic synchronization assumptions. It is almost always true that individual model 
elements are blocked because of pessimistic synchronization; the conclusion that processors 
tend to be blocked requires the assumption that all model elements assigned to a processor 
tend to be blocked simultaneously, or that each processor has only one model element. The 
latter assumption pervades many performance studies, and is unrealistic for fxned-grained 
simulation models executed on coarser grained multiprocessors. Intuition suggests that if 
there are many model elements assigned to each processor, then it is unlikely that all model 
elements on a processor will be blocked. Given sufficient workload, a properly designed 
conservative method should not leave processors idle, because there is so much work to do. 
While some model elements are blocked due to synchronization concerns, other elements, 
with high probability, are not. 

It is natural to ask how much performance degradation due to blocking a conservative 
method suffers. We answer that question, by analyzing a simple conservative synchronization 
method. The method assumes the ability to pre-sample activity duration times[20], and 
assumes that any queueing discipline used is non-preemptive. The protocol itself is quite 
simple. As applied to a queueing network it works as follows. First, whenever a job enters 
service, the queue to which the job will be routed is immediately notified of that arrival 
(sometime in the future), and the receiving queue computes a service time for the new arrival. 
These two actions constitute lookahead , a concept which is key to the protocol’s success. Now 
imagine that all events with time-stamps less than t have already been processed and that 
the processors are globally synchronized. For each queue we determine the time-stamp of the 


next job it would route (excluding one in service) if no further arrivals occur at that queue. 
The processors cooperatively compute the minimum such time, say <$(<). We will show that 
all further messages to be sent in the simulation have time-stamps at least as large as 8{t). 
Consequently a processor may evaluate, in parallel with all other processors, all of its events 
with time-stamps less than 8{t). Having done so, the processors synchronize globally, and 
repeat the process. The interval [f, £(<)) is called a window , and 8(t) — t is its width. 

We analyze the performance of the protocol by first deriving an approximated lower bound 
on the equilibrium mean window width. We then multiply this width by the equilibrium rate 
at which the simulation generates events. The resulting product is an approximated Tower 
bound on the the average number of events that are processed within a window. We then 
identify conditions under which the average number of events processed in a window increases 
without bound as the system simulation event generation rate increases. Next we analyze the 
synchronization, idle time, lookahead calculation, and event-list overheads of the protocol as 
a function of T, events in the system at a time. The average overhead per processed event is 
shown to be 0(/(T)), where f(T ) is the complexity of the average per-event overhead in a 
optimized serial simulation. Therefore the protocol’s asymptotic performance (as T — > oc) 
is within a constant factor of optimal. Finally, we demonstrate the viability of the protocol 
empirically. A parallel simulation system based on the protocol has been implemented on 
a thirty- two node Intel iPSC/2 distributed memory multiprocessor^]. Processor efficiencies 
in the range of 60% - 90% are reported for several different large simulation models. 

It is important to remember that our analysis concerns average case performance based 
on a general stochastic model. Specific problem examples can be constructed to ensure that 
the protocol essentially executes serially, while another can execute many things in parallel. 
We believe that such examples are somewhat artificial and do not shed a great deal of light 
on how performance will behave over a wide range of problems. Our intention is to study 
the average case performance on a model of typical simulation problems. 

This paper makes two basic contributions. One is to develop a new approach for the 
analysis of parallel discrete-event simulations. The second is a demonstration that many 
large simulation models having much concurrent activity can be effectively simulated in 
parallel using a simple conservative protocol. 

This paper is organized as follows. §2 gives some background for this work. §3 describes 
the model of discrete-event simulations we use in our protocol description and analysis, and 
then introduces the protocol. §4 derives an approximated lower bound on the average number 
of events processed in a window. §5 determines the complexity of the average total overhead 
per event suffered by using the protocol. §6 reports on the performance of the protocol on 
several different simulation models. §7 gives our conclusions. 
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2 Background 

Our protocol is similar to others recently proposed [1], [4], [13], [14], [28]. Unlike earlier 
asynchronous protocols, these synchronously move a window across simulation time, roughly 
as follows. Let floor be the lower edge of the window. This means that all events with 
time-stamps less than floor have already been processed. The processors then cooperatively 
determine the upper window edge, ceiling . This value is chosen in such a way that all events 
within [floor } ceiling) can safely be processed in parallel, ceiling becomes floor for the next 
window, and so on. 

The major question with synchronous conservative protocols is whether windows small 
enough to prevent dependencies between window events admit enough such events to keep 
all the processors busy. Lubachevsky was the first to answer this question [14], by deriving 
a lower bound on the number of events processed within a window defined by his method. 
Using this bound and some assumptions concerning event density (in simulation time), he 
shows that the performance of his method scales up as the problem size and number of 
processors are simultaneously increased. However, his results are not quantitative, although 
they might have been so developed. Our analysis is different, in that we define a model 
from which event densities follow naturally, and we quantify the average number of events 
processed in a window. Ours is an average case analysis, while Lubachevsky’s is a worst 
case analysis. Also, Lubachevsky’s analysis hinges on the assumption of a non-zero mini- 
mal propagation delay, while ours does not. We do show that minimum service times can 
dramatically improve the average number of events processed each window. 

The protocol we study is an application of the one described by Chandy and Sherman [4] 
to a more restricted problem domain. Like Lubachevsky’s method, they require periodic 
global synchronization among processors. Each window their protocol computes the min- 
imum time-stamp among all “conditional” events, and then processes all “unconditional” 
events with smaller time-stamps. In addition, their technique incorporates the conversion of 
“conditional” events into “unconditional” events, as a function of messages exchanged in the 
simulation. Such conversion is highly application dependent. The most important difference 
between our protocol and the general conditional-event approach lies in the specificity of 
our conversion of conditional events into unconditional events, in a way that requires lit- 
tle model-specific information. Furthermore, our protocol is stated within the context of a 
model closer to those used by simulation practitioners than is the model used to describe 
the conditional-event approach. 

Our analysis of lookahead is related to that developed by Lin and Lazowska in [10], and 
by Wagner and Lazowska in [30]. Their work analyzes the ability of different queue types 
to predict future behavior, and focuses on lookahead at a single queue. Our analysis is of a 
much simpler lookahead scheme, but analyzed over the entire simulation. The protocol we 
describe can be easily adapted to accommodate these more complex techniques for computing 
lookahead. We have also analyzed a different class of simulations than the one studied 
here, on massively parallel architectures[19]. The sensitivity of performance to lookahead is 
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quantified, upper bounds on optimal and optimistic performance are derived, as is a lower 
bound on the performance of the same protocol we study in this paper. 

Some analysis exists of the optimistic “Time Warp” method of synchronization. The ear- 
liest analyses concerned detailed stochastic models of two processor systems [9, 18]. These 
models include overhead costs and permit heterogeneous processors. Most other studies of 
Time Warp tend to assume negligible state-saving and rollback costs. For example, Lin and 
Lazowska have shown that if Time Warp has no state-saving or rollback costs, and if “cor- 
rect” computations are never rolled back, then Time Warp achieves optimality [11]. This is 
intuitive, because Time Warp aggressively searches for the simulation’s critical path — if it is 
able to do so without cost, its performance must be optimal. Other analyses highlight the 
fact that Time Warp can “guess right” while conservative methods must block. Lipton and 
Mizell have shown that there is a certain asymmetry between optimistic and conservative 
methods: while it is possible for an optimistic method to arbitrarily outperform a conserva- 
tive method, the converse is not true [12]. Their analysis explicitly includes overhead costs. 
Madisetti, Walrand, and Messerschmitt [16] have developed a performance model which es- 
timates the rate at which simulation time advances under an optimistic strategy such as 
Time Warp. They model the behavior of the system as a Markov chain, and include the 
cost of communication and of synchronization. Their analysis is exact for two processors, 
and approximate for a general number of processors. Lubachevsky, Schwartz, and Weiss use 
a sophisticated stochastic model to show how it is possible for Time Warp simulations to 
thrash in periods of “cascading rollbacks” [15], 


3 Model and Protocol 

We now describe our model of discrete-event simulations more formally, and define the 
synchronization protocol. 

3.1 Model Assumptions 

Consider a domain containing S sites , where activities occur. An activity (e.g., service 
given to a job at a queue) begins, ends, and upon its completion enables (i.e. causes) other 
activities. These causations are reported to the appropriate sites by way of completion 
messages. Consequently, three distinct events are associated with each activity: enable, 
begin, and complete. The enable event for a given activity can be different from the begin 
event if the site imposes queueing. We permit a completion to cause more than one activity 
in order to include simulation problems such as Petri-nets, where a single transition firing 
may cause token arrivals at multiple Petri-net places. Thus, we assume that a complete 
event at site Si causes an activity at each member of a random subset of other sites. All 
enable events caused by a completion have the same time-stamp as the completion. An 
activity is said to be occurring at time t if its associated begin event has a time-stamp no 
greater than i , and its complete event has time-stamp no less than if'. Fach site maintains 
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its own priority queue of events associated with activities enqueued or occurring at the site. 
Each site also maintains its own simulation clock, which records the time-stamp on the last 
event processed. 

Under the assumptions of our model the enable and complete events are unconditional— 
once placed on the event list, no further activity in the simulation will change them, begin 
events may be conditional. For example, a begin event at time t might describe the future 
placement of a particular job into service at a queue at time t. If before t another job with 
higher priority arrives, that begin event may be removed from the event list. 

Depending on the ability of the site, activities may occur there one at a time, or concur- 
rently. We assume that either an unbounded number of activities may simultaneously occur 
at a site, or that only one activity may occur at a time. In the former case, we say the site 
has infinite servers. In the latter case, enabled activities may be enqueued before occurring. 
The delay in simulation time between when an activity begins and ends is called its dura- 
tion. We assume that a duration is strictly positive, but do not assume a minimal duration. 
For the purposes of analysis we assume that the simulation model is ergodic, and that each 
duration time comes from a distribution composed by adding a nonnegative constant to an 
exponentially distributed random variable. Each site may have a unique distribution. 

Our performance analysis rests on a number of assumptions about the simulation model 
which are exploited by the protocol. 

1. We assume that once an activity begins, the causation of further activities cannot affect 
its completion time. 

2. We assume that the simulation state change due to an activity completion is very 
local — the state change is implied by knowledge of which activity completed, which 
activities are subsequently caused, and the time of the completion. 

3. We assume that the activities caused by the completion of activity Aj can be reported 
to their respective sites at the time that Aj begins. 

4. We assume that a lower bound on the duration of an activity can be determined at 
the time of the receipt of the completion message which causes the activity. 

To illustrate these assumptions, consider a job J which at time s begins service at a non- 
preemptive queue Q u completes at time s, and is routed to Q 2 . Assumption 1 is satisfied 
by the nature of <5i s queueing discipline. Assumption 2 is satisfied because the change in 
model state due to this departure is completely characterized by knowledge of Q i , Q 2 , and 
s. Assumption 3 is satisfied if the service discipline at Q x is non-preemptive and the routing 
is independent of the jobs enqueued at time s: in the simulation we can report the arrival 
of J at time s to Q 2 concurrently with the entering of J into service at Q\, at time s. By 
doing so, the processing required of J’s completion event at s does not include reporting 
J's departure, but may include the recording of statistics which depend on all simulation 
activity at Q\ (including arrivals) up to time s. Assumption 4 is satisfied if J s service time 
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at Q 2 can be computed at the time that Qi reports the arrival of J to Q 2 . This is possible 

if the service time of every job at Q 2 is drawn independently from the same probability 
distribution. 

This model describes a large number of common simulation models, and is related to 
event graphs described in [26] and [27]. Many queueing networks are obviously captured. 
Logic networks are described, with activities corresponding to logical module evaluations. 
Here new activities are caused when a module output changes state. The movement and 
interaction of objects in a domain can also be captured. One assumes no queueing at sites, 
and models the passage of an object across some discrete region of the domain as an activity. 
Lookahead plays a major role in our synchronization method and its analysis. Lookahead 
exists and is exploited by assumptions 3 and 4 above. 

Simulation workload is the event processing. This includes changing anticipated event 
times as a result of newly caused activities, in changing simulation state variables, and in 
gathering/recording statistics. We view event list management costs as inescapable overheads 
associated with the processing of events. 

Our protocol does not require a minimal duration time for its correctness. However, 
performance is substantially enhanced if every duration time is Founded from below by 
£>min > 0. Equivalently, we can introduce a minimal time delay between whenTan 

activity completes, and when activities it causes are enabled. We will use D^ n throughout 
our analysis, but may take it to be zero. ° 

3.2 Protocol Definition 

Next we define the synchronization protocol in terms of the model given in §3.1. Our only 
architectural assumptions are that the simulation model is executed on a multiprocessor 
having P processors; any processor can send a message (indirectly, if needed) to any other 
processor, and the processors can synchronize globally. 

One important aspect of our protocol is the “pre-sending” of completion messages. Let 
Aj be some activity whose begin event has time-stamp s. Let s be Ay’s completion time. 
Under our protocol Aj's site must send completion messages to all sites where activities 
caused by Aj ' s completion will occur, at the time Aj begins. Observe that even though the 
simulation time at A,’s site is s, these completion messages are time-stamped with time 
s > s. A site which receives such a notification inserts an enable event with time-stamp J 
into its event list (a non-queueing site may directly insert a begin event with time 3); it. also 
selects a duration time (or a lower bound on it) for the newly caused activity. 

Suppose the processors have globally synchronized, and let t be the minimum time-stamp 
among events at all sites. Each site 5, can determine a lower bound <5,(f) on the earliest 
completion time of any of its pending (i.e., as yet not begun) activities, assuming no further 
enable events are received. We call this the site’s lookahead bound. For example, consider a 
site Si with queueing. There are three cases to consider. 

Case 1: 5,- ’s event list is void of enable events. In this case we define S,(t) = oo. 
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Case 2: No activity is occurring at t, and Si ’s event list contains enable events. Let u be 

the earliest enable time among these, and define Si(t) to be the completion time of the 
activity enabled at u. 

Case 3: Some activity is occurring at t, and Si ’s event list contains enable events. Define 

$i(t) to be the completion time of the next enabled activity to receive service, assuming 
that no further enable events will be inserted into the event list. 


If Si has infinite servers, only two cases arise. If there are no begin events in Sfs event 
list, then define £,(<) = oo. If there are begin events in Si's event list, define £,■(£) to be the 
minimum completion time among these. 

Finally, define 


6(t) = min 

all sites S t 


{*■(<)}■ 


The protocol is very simple. Define w x = 0, and proceed as follows. 


1. Given w n , the processors cooperatively determine S(w n ). 

2. Each site may be simulated in parallel with all others until the time of the event with 
least time-stamp at that site is as large as 6(w n ). The processing of any begin event 
in this interval must include pre-sending the associated completion messages. 

3. Sites receive the messages sent during the processing of [w ra , S(w n )), select duration 
times for the associated caused activities, and insert events into their event lists. 

4. n = n + 1 . Goto step 1 . 

The obvious question to ask of this protocol is whether the sites can safely process all 
events within a window. The protocol is safe if, once the window is established, no further 
messages with time-stamps less than the upper edge of the window will ever be sent. The 
following theorem establishes this fact. 


Theorem 3.1 Let [u; n , 6(u; n )) be a window established by the protocol. Then every comple- 
tion message sent during the processing of [w n ,6(w n )) has a time-stamp at least as large as 
S(w n ). 


Proof: Completion messages are pre-sent by the processing of begin events. Let b 0 ,...,b k 
be the times of all begin events in [tn n , £(u> n )), in increasing order. We use induction to 
show that for i = 0, . . . , k , the completion messages associated with the begin event at 
time bi have time-stamps at least as large as 6(tc n ). For the base case consider bo, and let 
Si be the associated site. 5, computes Si(w n ) to be the minimum time-stamp on the next 
message it sends, provided no further messages are received at Si. By construction 5, will 
not receive any further messages with time-stamps less than b 0 , therefore the decision to 
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begin the activity at b 0 was correctly fore-seen during the computation of <5 4 -(u; n ), implying 
that the completion time of the activity beginning at 6 0 is no smaller than <5,(t/; n ), and hence 
is no smaller than S(w n ). This establishes the base of the induction. For the induction 
step suppose that the completion times of the activities begun at times . . . , 6j_i are all 

no smaller than S(w n ), Consider the activity begun at time 6j, and let be its site. As 

a consequence of the induction hypothesis, during the processing of [u; n , <5(tc n )) Si cannot 
receive any messages with time-stamps less than bj. Consequently, the decision to begin an 
activity at time b 3 was correctly fore-seen during the computation of £;(u; n ). The completion * 

time of the activity beginning at bj is thus no smaller than <5j(u> n ), and so is no smaller than 
6(w n ). This completes the induction. 

□ 

Under the assumption of non-zero duration times, it will always be true that w n < 6(w n ). 
Consequently, simulation time advances each window (even if no events occur in the window), 
and deadlock never occurs. - 


3.3 Example 

An example helps to illustrate the protocol’s mechanism. Consider a system with sites Si and 
S 2 . Site S\ permits an unbounded number of activities to occur simultaneously, while site 
S2 imposes queueing. The system moves objects between sites. Duration times are random. 
When an object completes its duration it either disappears, moves to another (possibly the 
same) site, or splits into a number of objects that move. S 2 uses Last-Come-First-Serve 
queueing. 

Let w n = 100, and imagine that objects 0\ and 0 2 are present at with scheduled 
completion times of 100 and 103. Object 0 3 is in service at S 2} and will complete at time 
101. Object 0 4 is enqueued at S 2 , and will eventually receive 4 units of service. 

The completion of 0\ at time 100 sends 0\ back to 5 X , where it will receive another 8 
units of service; the completion of 0 2 at time 103 sends 0 2 to S 2 where it will eventually 
receive 6 units of service; 0 2 ’s completion at time 103 also creates a new object O s which 
is sent to S 1 , where it receives 4 units of service. At site S 2 , 0 3 completes at time 101, 
and then remains at S 2 , where it will receive another 5 units of service. Observe that the 
messages reporting the completions of Oi, 0 2 , and 0 3 have already been sent, and the “next” 
durations of those objects have already been chosen. 

This scenario is summarized in figure 1, along with the contents of S 1 and 5 2 ’s event lists 
as observed at time 100, The event lists reflect the practice of pre-sending object arrival 
notices. S 1 determines its lookahead bound ^(100) by finding the minimum completion 
time among all objects it knows will arrive at or after time 100. 0 2 arrives (again) at time 
100 and completes at 108. 0 6 arrives at 103 and completes at 107, making <5i(100) = 107. 
S 2 determines <!> 2 (100) by identifying the next object to complete service that isn’t already in 
service. Because S 2 is LCFS, the arrival of 0 3 at time 101 causes 0 3 to receive service before 
0 4 . <5 2 (lOO) is 106, so that <5(100) = 106. Si and S 2 are thus free to simulate all events with 
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Objects at time 100 


Object 

Site Arrives Duration Completes Routed 

Comments 


O ! 

S 1 

? 

? 

100 

Si 

Occurring at time 100 

O ! 

Si 

100 

8 

108 

? 

Caused by completion of self 

O 2 

Si 

? 

? 

103 

S 2 

Occurring at time 100 

O 2 

s 2 

103 

6 

7 

? 

Caused by completion of self 

O 5 

Si 

103 

4 

107 

? 

Caused by 02 at Si 


O 3 

S2 

? 

7 

101 

S 2 

Occurring at time 100 

O 3 

S2 

101 

5 

106 

? 

Highest priority activity at 101 

O 4 

S 2 

? 

4 

7 

? 

Lower priority at 101 





< 5 ( 100 ) 

= 106 







Event Lists 





Si Event List at time 100 

S 2 Event List at time 100 




Event 


Time 

Event 

Time 




0\ completes 


100 

0 3 completes 101 




0\ arrives 


100 

03 arrives 

101 




0 2 completes 


103 

02 arrives 

103 




O 5 arrives 


103 






4 events processed 

in [ 100 , 106 ) 

3 events processed in [ 100 , 106 ) 



Figure 1 : Example of Synchronous Protocol Operation 


times no greater than 106 , in parallel. Si has four such events, S 2 has three (or four, if the 
processing of the O 3 arrival event at 101 creates a begin event at 101). 

Arrival events (enable events) at S\ may also serve as begin events since no queueing is 
imposed. Each site’s processing of arrival events includes the decision of where to route the 
object upon completion, and the generation of completion messages with the appropriate 
time-stamp. 

4 Analysis of Protocol 

Our performance analysis derives an approximated lower bound on the mean window width, 
then multiplies by the equilibrium event creation rate in order to bound the average number 
of events created per window'. By flow balance this bounds the average number of events 
processed per window. We then consider the behavior of this average as a function of 
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simulation activity rate, and minimum duration time. 

The analysis to follow uses results from the theory of stochastic order relations, and ma- 
nipulates hazard rate functions. Readers unfamiliar with these tools should consult Ross [25]; 
the appendix quickly sketches the main ideas and results we use. 

We are interested in the limiting value of the expected window width E\8(w n ) — re n ] as 
n ~ > °°> supposing that the limit exists. As we will see, a window’s width is comprised 
of the minimum of a number of complicated random variables. Complications arise both 
due to randomness in the model (e.g., random selection of sites where activities are caused 
following a completion), and due to dependence of the random variables’ distributions on the 
past activity in the simulation. Our approach is to bound the mean window width from below 
with the mean minimum of much simpler, and stochastically smaller, random variables. The 
stochastically smaller variables are constructed by considering hazard rate functions. This is 
a useful analytic trick which exploits the fact that the hazard rate function for the minimum 
of a group of independent random variables is just the sum of their individual hazard rate 
functions. 

One step in the bounding argument is intuitive, but not rigorously justified. Therefore 
one can only rigorously call our results approximate. 

The analysis uses a slightly more formal model than we have yet described. The duration 
time distribution for site S’, is taken to be ZA+exp{/<;}, where Di > 0 is constant and exp{/q-} 
is exponential with mean Pi = l/A,-. We let D^n be the minimum D{ value among all sites. 
The discussion of random variables, means, and hazard rates all concern the stochastic 
portion of the duration times. 

Our bounds depend on the manner in which a completing activity causes activities else- 
where. To more precisely describe these effects, for every site S,- let Reach(Si) be the set 
of all sites where activities caused by a completion at S,- can occur. For convenience we 
assume that the activities caused by a single completion are all at different sites. Activity 
Aj completing at Si randomly chooses a subset Bj C Reach ( 5',), and causes one activity 
at each site in Bj. We assume B : is chosen independently of the duration values of the 
caused activities. The distribution governing this choice is particular to S,-; p(B,i ) denotes 
the probability that B C Reach(Si) is the selected set. 

Let Aj be an activity occurring at site A,, and let B be the set of sites with activities 
caused by Aj. We will be interested in the rate at which the first activity completes, among 
all those caused by A r Towards this end, we focus on the stochastic portion of these activity 
durations. The “rate” of the minimum stochastic portion is just A # = Ys eB (see §A.2). 
The expected rate (with respect to the distribution of B ) is defined by 

* = £ P(B.i) [ E Ad 

BCReachiS,) \Sj6S / 

= Pr{compIetion at S* causes an activity at Sj}Xj. (1) 

Sj^Reach(St) 

Pathological analytic difficulties are avoided by assuming that the simulation model al- 


10 



ways has at least one activity occurring that causes other activities. This can be ensured, 
for example, by adding a “clock” site that does nothing but process a single, periodically 
self-causing activity. 

Let A(ui„) be the random set of activities occurring at time w n . During most of the 
analysis to follow we will condition on knowing that A(w n ) is some fixed subset V. The 
conditioning is undone later when we take an expectation with respect to the distribution 
of A(ui n ). 

The discussion to follow focuses on activities. To facilitate precise reference to the site 
where a given activity occurs, we often use the notation to describe the site at which 
activity Aj occurs. 

Consider the construction of the nth window. To compute 6(w„) we examine each site 
to determine the time of the next message it will send, in the absence of receiving any 
further messages. For a non-queueing site S',- this is the minimum completion time among all 
activities with begin events in S/s event list. For each such activity there is another which 
caused it, and which is occurring at time w n . If Si is a queueing site, the activity Aj whose 
completion defines S/ s lookahead bound is either enqueued waiting for the completion of an 
occurring activity at S',, or has its enable event sometime in the future. In the latter case we 
know there must be another site with an activity which is occurring at time w n , and which 
causes Aj. Therefore, every activity whose completion defines some lookahead bound can be 
associated with an activity occurring at w n . Conversely, for every Aj 6 V we can associate 
a set of sites Cj with activities caused by Aj, such that the completion time of each activity 
Ak £ Cj equals 8 s ( k )(w n ). Cj is obviously a subset of Bj, the set of all sites with activities 
caused by Aj, so that the minimum completion time among all activities caused by A } at 
sites in Bj is no larger than the minimum taken over Cj. 

We will want to distinguish queueing sites from non-queueing sites. We therefore define 
the indicator coefficient 7; to have value 1 if site Si is a queueing site, and to have value 0 if 
not. 

For every Aj £ V let Rj(w n ) denote the residual duration time of Aj — the difference 
between A/s completion time and w n . For every A } £ V at a queueing site S»(j) define 
Nj{w n ) to be 00 if 8 s(j) = 00, otherwise it is the duration of the enqueued activity whose 
completion time is < 5 s (j)(u; n ). Let E j(ui„) be the enabling time of the activity defining Nj(w n ). 
Observe that this activity is sensitive to w n : if Aj was occurring at time w n _i it is possible 
for a higher priority activity to be enabled between times u>„_i and w n , so that the activities 
defining Nj(w n ^i) and Nj(w n ) may be different. 

We define Nj(w n ) = 00 if S 3 ^ is a non-queueing site. Regardless of whether 5 s(j ) is a 
queueing or non-queueing site we may say that the completion rate of Nj(w n ) s stochastic 
portion is 7 s(j)^a(j)- 

Again, let Bj be the set of sites with activities caused by Aj ; for each S k € Bj let D k + Y jik 
be the duration of the activity at S k caused by Aj. We define A/s lookahead bound to be 
the minimum completion time among (i) the activities caused by Aj, (ii) the next activity 
to complete at S s{j) if S s(j) is non-queueing and receives no further enable events. A/s 
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lookahead bound as measured at time w n may be written as 


K j( w n) = w n + Rj(w n ) + min{max{0, Ej(w n ) - #,(«>„)} + Nj(w n ), min {D k + K,, fc }}}. 

Sk € Cj 

3{ w n) is the minimum lookahead bound among all activities Aj £ V. We may therefore write 
E[<5(u>„) - w n | A(w n ) = V] = E[ min (J^K)}] 

Ay € V 

- E \ I*\ [ l n v {^j( w n) + min{iV j (uj n ), min {£>* + Y,- *.}}}] 

A j t v Sk 6 By 

( 2 ) 


The expectation above is complicated by its dependence on the history of the synchronization 
behavior up to time w n . For example, suppose that activity Aj began in the ( n — 6j)th 
window, for some bj > 0. The distribution of Kj{w n ) must be conditioned on the event 
G](w n ) that Kj(w n _ c ) > u) n _ c+1 for all 1 < c < bj. Since Kj(wn) is largely comprised of 
random variables that also comprise Kj(w n _ c ) for each c, conditioning on Qj(w n ) makes 
each Rj(w n ) probabilistically larger than it would be if each component random variable 
had its original, unconditional distribution. The starting point for our bound is to build a 
stochastically smaller replacement for each Kj(w n ) by replacing each of I\ j(ic„)’s components 
with a pristine unconditional random variable with the appropriate distribution. 

We construct an “unconditioned” lookahead variable for each Aj as follows. Randomly 
choose a subset U s (j) C Reach(S s ^)) in accordance with the probability distribution {p( B, *m, 
and independently choose a duration time D k -f X j<k for each S k £ U s (j). D k + X Jtk will re- 
place the actual corresponding duration time D k -\-Yj ik . Randomly and independently choose 
some value D s ^ 4- from 5,(j)’s duration time distribution. If S s ( 3 ) is a non-queueing 

site we take Wj^j) — oo. D s ^ will replace the actual Nj(w n ). Let Zj <s ^y be an 

independent exponential having the distribution of the stochastic portion of S s ^'s duration 
time. Zj, a (j) will replace Rj{w n ); note that the residual of an ^ 3 (j) duration time is always as 
large as the residual of the duration time’s stochastic portion. 

The event Gj(w n ) gives us information that Kj(w n ) is probabilistically larger than it 
would be if its components had their original distributions. Therefore, intuition suggests 
that the following inequality is true 


lim £[<5(u; n )— > lim E\ min 
n ~ f0 ° Aj e A(w n ) 


{ + min { D s(j) + Wj Aj) , min 

b k 6 U s (j) 


{£>*+*,,*}})]. 

( 3 ) 


Note that the expectations involved in this assumption are not conditioned on A(w n ) = V, 
and that we only require the inequality to hold in the limit of n — ► oo. It seems exceedingly 
difficult to formally establish this bound. Our analysis therefore proceeds by assuming its 


validity. 


Assumption 4.1 Inequality (3) is true. 
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We continue the analysis by placing stochastic lower bounds on variables comprising the 
conditional (on A(w n ) = V) expectation 


E[ 


min 

A } ev 


{Z iMj) + min {D 3{j) + W jAj) , 


min 

s k e 


{Dk + X**}}}]. 


(4) 


As a first step we note that 

min {D k + Xj k ) > min {Xj tk } + ^min- (5) 

Next we put a stochastic lower bound on min-fAy^lSfc € This random variable is 

complicated by the fact that U s (j) is a random set. For any given set U s (j) — B{, the 
minimum of |B,| exponentials is itself an exponential, with rate A B , = I Zs k eB, ^k- Conse- 
quently min{X, ifc |S fc G Ms(j)} is a probabilistic mixture of exponentials — with probability 
p(Bi,s{j)) it is an exponential with rate A#,. Without loss of generality we may enumerate 
all subsets Bi C Reach{S s y)) in such a way that A#, < \b } whenever i < j. Given this or- 
dering, Lemma A.l establishes that an exponential Tj )S (j) whose rate is the “expected” rate 
^s(j) = Eb, p{Bi,s{j))X B (see expression (1)) is stochastically smaller than the minimum: 

min {X,,*} > 3f T jA j). 

s* e Mj(j) 

Applying inequalities (5) and (11) we determine that 


min{ £),(,•) + min {D k + X j>k }} 

S k £ 

> min{Z) 5 (j) + {Xj,k} + ^min} 

Sk € u 3 (j) 

>st rnin{^min "H “l" ^min} 

= min , 7j j5 (j) } + -Dmin* (^) 

Since W jyS ^ and Tj^j) are both exponential, their minimum is also exponential and has rate 
7 s(j) A s{j) + ^ 3{j) (recall that 7, (i) = 0 and W jAj) = oc if S s{j) is a non-queueing site). Let 
Uj^j) be an exponential with rate Inequality (6) holds for every Aj E V; 

furthermore, the lookahead random variable constructed for each Aj is independent of all 
others. Since the addition and min operators are increasing it follows from (11) that the 
expectation in (4) is bounded from below: 

E\ min {Zj, 3 (j) + min { Z) s (j) + Wj- min {D k + Xj, k }}}} 

Aj 6 V S k € M s (j) 

> E[ min {Z jMj) + U jAj) }] + ^min* (7) 

Aj 6 V 

We remove the conditioning on V by taking the expectation with respect to A(w n ), 


E[ min + min{D s (j) + Wy s (j), min {D k + X Jtk }}}] 

> E[ min {Zj, s (j) + + ^min- 

Aj € .A(u/ n ) 
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If A has the limiting distribution of A(w n ) asw->oo (supposing it exists), then 


> 


min {D a(j) + W jMj) , min {D k + Af it *}}}] 

z>k € U s (j) 

+ ^7,s(i)}] 4- L^min* 


( 8 ) 


Our next task is to deal with the randomness of the set A. 

Let the collection of site servers in the domain be enumerated as and define 

v(j ) to be the index of Vj's site. For each i = 1 , 2 , . . . , S and j = 1 , 2 ,... let 

~ JilS) ^'[ num ^ )er °f site 5, activities occurring at time w n ], 


Pi — Jirr^ Pr{at time w n an activity is occurring at V,}, 

and observe that 

“>i = X Pv 
V Jt vU)=< 

assuming that the expectations and limits exist. It is not obvious that should be identical 
to the equilibrium expected number of activities occurring at 5,*; intuitively one expects it 
to be close, because the number of windows in which a given activity is found occurring is 
roughly proportional to the duration of the activity. 

The expectation on the right-hand-side of (8) is taken with respect to a distribution of 
random sets of activities found occurring at a window edge. One can equivalently view it as 
an expectation taken with respect to a random set of servers found busy at a window edge. 
Inequality (7) suggests we associate two exponentials with each server Vf Zj and U 3 (here 
binding j to the server rather than to the activity). There is a one-to-one correspondence 
between a random subset of servers, and a random subset H C {(Zi, CA), (Z2, C/2), 

Lemma A. 2 was developed to deal with the situation at hand. Following its statement 
we define 


A — X Uj) £ H}Ku)( iv(j)K(j) + i’v(j)) 

j = 1 

00 

3 = i 
5 

= X w «M7«A, + fa). (9) 

i=i 

The lemma’s conclusion is that 


E[ min 

iAitUj) € H 


{Zj + Uj }] > 



The left-hand-side of this inequality is identical to the right-hand-side of (8), except for the 
inclusion of D^n- Assuming the validity of assumption 4.1 we may conclude that 


Jfim E[8(w n ) - w n ] 



( 10 ) 
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In order to determine the average number of events processed per window we need to 
consider the rate at which events are generated by the simulation. Let be the equilibrium 
mean number of activities occurring at S{. There are two events associated with each activity 
at a non-queueing site, begin and complete . Adding enable , there are three events 
associated with an activity at a queueing site. We therefore define the variable e; to be 2 or 
3 depending on whether S, is non-queueing or queueing, respectively. 

An activity’s duration at S, has mean = 1 f(Di + Mi), so that the equilibrium event 
creation rate is A sys = . By flow balance this is also the equilibrium event 

completion rate. We can therefore multiply this rate times the lower bound on the mean 
window width to bound the mean number of events processed in a window. 

Theorem 4.2 Let 

Asys 

be the system event creation rate, and let 

s 

A = + i>i)- 

1=1 

Then if assumption 4-1 is valid, the average number of events processed per window is at 



□ 


= X e ^« A « 


v \( d ) 


{=1 


This theorem demonstrates how an existing minimal service time accelerates performance. 
Given constant Drain ^ 0, the bound increases at least linearly as the total simulation event 
rate increases. However, good performance is also possible when D„^ n 0, as we will see. 

The value of A is defined in terms of u^. We have no immediate cause for believing that 
uii = top, nor is it clear that the two quantities should be widely different. It seems reasonable 
then to take u>, c5, as a first approximation. Doing so permits us to analytically estimate 

A in some simple cases, and quantify the bound given by Theorem 4.2. 

As pointed out by Wagner and Lazowska [30], interconnection topology plays an impor- 
tant role in determining the performance one achieves with a queueing system. Network 
bottlenecks limit the volume of simulation activity. This is reflected in Theorem 4.2. For 
example, in a network where each site has one server, is approximately the server uti- 
lization. A bottleneck site will have a very high utilization while those at other sites are 
comparatively low. After a point, adding jobs to the network does not appreciably increase 
the sum of server utilizations, hence the overall event rate does not appreciably increase. For 
the same reason simulated queueing systems are constrained even if the throughput at each 
site is equal. The overall system event rate is maximized when all site utilizations are one. 
After a point, to increase simulation activity one needs to increase the size of the network. 
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We can approximate the bound in Theorem 4.2 in some simple cases. Consider a model 
where objects move throughout the domain. An object resides at a site for a fixed time 
plus an exponential time with mean 1/A, and then moves to any other site, chosen uniformly 
at random. Equilibrium flow balance equations are easily solved in this situation. Working 
through the details with K objects and D n ^ n = 0, one discovers that at least \JirKj2 events 
are processed per window, on average. A relevant point is that the inter-site communication 
topology is that of a fully connected graph. Such topologies are generally taken to be 
extremely taxing on conservative synchronization methods, because the “next” event at a 
site can come from anywhere. Nevertheless, a significant amount of work is performed each 
window, at least when K is large. Figure 2 plots the analytically bounded and empirically 
measured average number of events processed per window, as a function of log 2 I < . The 
empirical measurements represent the sample mean of ten long simulation runs. There was 
very little variance between these runs. Figure 2 shows that if thousands of objects are in the 
model, hundreds of events are processed each window. Since parallel processing techniques 
are used primarily when serial processing times are too slow (or memories are too small), 
we see that this result applies directly to situations of practical interest— large simulation 
models on medium scale parallel architectures. 

Performance is greatly enhanced when D„u n > 0. Figure 3 plots measurements of the 
number of events per window for small models, having only 256 and 1024 objects. The same 
measurement methodology as was described for Figure 2 is used here. The analytic bound is 
not displayed, being indistinguishable from the measured performance when plotted on the 
graph. Dmin is varied between 0 and /z = 1/A, so that Dmin/f* varies between 0 and 1. We 
see that if a model has minimal duration times we can expect many more events per window 
than if not. Note that the protocol does not need to know D^, as it is already part of the 
pre-sampled duration times. Dramatic performance improvement as one’s ability to “look 
ahead increases has also Teen observed by Fujimoto [6], 

Our confidence in the conclusions of Theorem 4.2 is increased by the fact that the approx- 
imated lower bound did uniformly fall below measured performance. Similar results have 
been observed when comparing the measured and bounded performance on less homogeneous 
simulation models. 

5 The Cost of Conservative Synchronization 

Next we consider the overheads involved in implementing this conservative protocol. First 
we identify conditions under which the average number of events processed per window 
will grow without bound as the system event creation rate grows without bound. Then 
we show that as the number of events processed per window grows, our method’s per- 
event overhead due to synchronization, processor idle time, lookahead calculation, and event 
list manipulation becomes within a constant factor of average the per-event overhead of 
performing the simulation serially. 
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Figure 2: Average events per window, as function of number of objects, D m i n — 0; durations 
are homogeneous exponentials; no queueing; routing is uniformly random. 


One way of increasing the system event creation rate A s ys is to increase the “size” of the 
model. For example, we increase the size of the moving objects simulation described earlier 
by increasing the number of objects in the domain. We may also increase the number of 
sites, although in this case it is not necessary. Theorem 4.2 shows how the average number 
of events processed each window may increase as A s y s increases. Clearly, if D lx ^ n > 0 then 
at least tisyDmn events are processed each window on average. It is also possible for the 
average number of events to increase without bound as A $ys increases even when -Dmin == 0. 
For example, suppose there is a value a such that as the size of the simulation model is 
increased the following bound is true for all sites S i\ 


This condition is a formal statement that as the model size grows ipi can’t get too large 
relative to A;, and that any difference between and Ui doesn’t get too large. The first 
condition will be satisfied if there exists A max and i? max such that as the model size grows, 
A,- < A max and \Reach(Si)\ < 7? max , for all i. The second condition ought to be satisfied 


17 



Avg Events/Window 

1500 q ^ 1 ^ 1 ^ 

I * a measured avg, 256 objects 

1350 

I • measured avg. 1024 objects 

1 200 -| 1- "T — | 

1050 — u — * — - 

900 1 . " 1 

750 -I - ,, 

O 

600 1 • 

a 

450 “ — — ;; “ 

300 o- 5 — : — i a )o — 

1 50 - - tl u 

3! □ '■ D 

0 ~ I — I T — n — 1 I i — I — 1 t I — t — — t — T 1 ! ! 1 — T 1 ~T 1 1 1 r — r— ! i i — 1 1 1 T — I T T T T — I 1 — 

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 

Constant/Stochastic mean 








* a measured avg, 256 objects 
• measured avg, 1024 objects 
















• 1 







i 

m 

• ' 







• 

• 







4 

H 








i 

• 

* 








• 









t 

m 



j □ C 

1 ° * 

1 □ G 

\ O 1 

J □ 1 

1 G c 

i □ ‘ 

\ ° ‘ 

1 ° i 

j □ c 

1 






| 


i — i i ■ i — 1 — s — i — i — i — — i — ; — i — r — — i — ; — i — i — I — i — i — i — i — Hi — i — i — i — — r — r— s — i — — i — i — i — i — — i — i — i — i — — f — i — i — r 


Figure 3: Average events per window when > 0. Performance plotted as function of 

Dmin/M? f° r 256 and 1024 objects. 


if our intuition that u>i & is correct. If the bound above holds, then as the model size 
grows the inequality A < a A s y s will always hold. It follows that at least yjn XsysJ{2a) events 
are processed each window on average, a number that grows without bound as A s ys grows 
without bound. 

As a point of comparison, we assume that a serial implementation uses the best known 
event list management algorithm. If there are T total events in the system on average, we let 
f{T ) be the average complexity of an optimized serial event list algorithm. For example, there 
is some evidence that a “calendar-queue” implementation has an average 0(1) complexity 
(i.e., f(T) = 1) on the hold model [3]. A number of other event list algorithms exhibiting 
ft(logT) average complexity are also commonly used [8], We assume that the serial event 
list algorithm permits the deletion of a non-minimal element without affecting the overall 
average complexity. This assumption is satisfied by the calendar queue implementation. 

We make the reasonable assumption that as the simulation model size is increased, T 
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grows at least as rapidly as S, i,e., T — 1 

Now consider a parallel simulation that uses our protocol. The requirement that comple- 
tion notices be pre-sent may increase the average total number of events in event lists at a 
time. This represents a factor of two increase, at most. In the complexity analysis to follow 
we need not explicitly concern ourselves with this constant factor increase. 

One overhead suffered in the parallel simulation is the cost of computing 6,(u>„) tor each 
site Si. This value changes only when an event is inserted or deleted at 5,-. A queueing site 
can recompute this value with 0(1) cost whenever its event list changes. A non-queueing site 
can organize the completion times of its pending activities in a priority queue using w a ever 
event list algorithm is employed by the optimized serial implementation. The minimum value 
in the priority queue defines S { . The priority queue is modified only when the site s event 
list is modified, at cost 0(/(T)). A processor can organize the 6, values from each of its si es 
into priority queue, enabling it to determine the minimum on-processor <5, value at least as 
quickly as the optimized serial implementation finds its minimal element. Maintenance of 
this priority queue costs 0(/(S)) on average for each processed event. Once each processor 
has determined its locally minimum 8, value, all processors may cooperatively compute the 
global minimum in C Syn ch time. Note that our assumption that P is fixed permits us o 

ascribe a worst-case constant cost to this operation. ,111 1 

Another overhead is processor idle time. The protocol is punctuated with global synchro- 
nizations, between which the processors execute in parallel. A processor with little workload 
will spend a long time waiting for more heavily loaded processors to reach the synchro- 
nization barrier. Suppose there are W events to process in a window. For the purposes of 
analysis, assume that each event may be mapped to any processor, with equal probability . 
Then the number of events assigned to a processor is a binomial B{W, l/P) random variable. 
The collection of workload random variables are not independent however, as we know they 
must sum to W. However, it isn’t difficult to construct a coupling[25] argument to show that 
the expected maximum workload of this system must be smaller than the expected maxi- 
mum workload of a system where each processor has an independent ( » / ) w ° r ° a 

The binomial distribution has an increasing hazard rate function [25](p.280); ; it is there ore 
stochastically less variable than an exponential with the same mean 25](p.2 3), an ence 
the expected maximum of P independent exponential random variables with mean WJP is 
at least as large as the expected maximum of P independent J anables ' 

The expected maximum of the exponentials is approximately ( W/P)\n(W/P). Assuming 
each event takes the same amount of time to process, the average fraction of time a processor 

is left idle is no greater than 

(W/P) J_ 

(W/P)\nP In P 

This implies that the average overhead cost per event due to processor idleness is 0(1). This 

iThis can’t rigorously be true, since events at the same site are evaluated on the same processor. It is a 
reasonable approximation when W is large compared with P. 
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analysis is actually quite pessimistic. Much better load balance can be achieved through 
use of mapping techniques such as scatter decomposition^]. Also, the bound above is 
insensitive to increasing volume of workload, whereas in practice the proportion of idle time 
tends to decrease as the volume of workload increases. 

The complexity of the average per-event overhead due to event list manipulation, looka- 
head calculations, processor idle time, and global synchronization is 


+ p(m) + o( i» + c Svnch 
w 


= 0(f(T)). 


Relative to the serial simulation, performance must then be within a constant factor of 
optimal, at least if inter -LP communication costs are ignored (we have already accounted 
for the communication costs that are specific to the protocol). Inter-TP communication 
costs are dependent on the simulation model and its mapping, and are dependent on the 
arc itecture. It is possible for communication costs to overwhelm performance, even if our 
protocol finds a great deal of parallel workload. However these costs are inherent to the 
model, and would be suffered under any synchronization protocol. 


6 Empirical Results 

We used the protocol analyzed in this paper in a parallel discrete-event simulation testbed im- 
plemented on the Intel iPSC/2 distributed memory multiprocessor^]. The testbed, YAWNS 
(Yet Another Windowing Network Simulator) [21], is designed to permit rapid development 
of simulation models, by providing a framework within which all synchronization and inter- 
processor communication activity is automated, and hidden from the user. YAWNS uses 
a computational paradigm where the simulation model is decomposed into communicating 

Logical Processes ( LPs ). LP s interact by passing messages. A site in our analytic model 
plays the role of an LP. 

The simulation modeler must provide the testbed with three routines for each LP (the 
L P ' s ma y sh * Te these routines). One routine processes messages, typically inserting an event 
into the TP’s event list as a result. This routine is responsible for choosing a duration time 
for the enabled activity. Another routine processes events. Messages to other TPs may be 
generated as a result of calling this routine; these messages correspond to the completion 
messages described in the analytic model. The third routine is called to obtain the lookahead 
value required of an TP. YAWNS demands that the simulation modeler know about the 
protocol only to the extent that inter-TP messages are pre-sent, and an TP must be able to 
determine a lower bound on the time of the next message it sends. 

It is always important to use the best possible event list algorithm for an LP. YAWNS 
provides a linearly-linked list algorithm for use when the number of events in an TP’s list is 
small, and a splay-tree algorithm for large lists. 

We report on the performance achieved by four diverse applications: the moving objects 
simulation described earlier, a logic network, the game of Life, and a timed Petri net model. 
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All measurements reported are taken from a thirty-two processor machine. Each simulation 
model was run long enough to generate several millions of events. The execution time was 
typically a minute or two, once the problem was loaded and running. Much longer runs were 
also performed, but no appreciable difference in performance statistics were observed. 

The measured performance supports our analysis, and actually becomes quite good on 
large problems. The metric we use to gauge performance is average processor utilization, 
measured as the fraction of time a processor spends doing work that would be performed in 
a serial implementation of the simulation, using the same paradigm. Time spent in comput- 
ing lookahead, synchronization, interprocessor communication, and idle time are explicitly 
counted as overhead, and do not appear in the utilization figure. One can translate such 
efficiencies into “speedup” figures by multiplying by the number of processors used, provided 
the resulting numbers are properly interpreted. The speedup so computed is relative to a 
serial version that uses the same paradigm (and code) of communicating LP s as is used in 
the parallel version. This is not an unreasonable paradigm for a general purpose serial simu- 
lation system, but is not likely to be the paradigm of choice for a serial version that is highly 
optimized for the given application. In our experience (and depending on the application), 
the communicating LP paradigm is a factor of 1.5 to 2 times slower than an optimized serial 
version. The usual comparison of serial running time to parallel running time is impossible 
to directly obtain, as the largest models we simulate are too large for a single processor’s 
memory. We will see that on the largest problems the average processor utilization ranges 
from 60% — 90%. 

6.1 Moving Objects 

The sites are connected in a hypercube topology. In each model there are exactly as many 
objects as there are sites. Each object resides at a site for a time constructed by adding 
0.25 to an exponential with mean 1 . We increase the size of the problem by simultaneously 
increasing the number of objects and the number of sites. We may therefore describe the 
size of the system by the dimension of the underlying hypercube. Pre-sent completion times 
and lookahead values are computed exactly as described for non-queueing sites in this paper. 
Average processor utilization p as a function of hypercube dimension is given below. Many 
simulation runs were performed, the variance in the timing numbers is quite small. 


Dim 

8 

9 

10 

11 

12 

13 

14 

P 

21% 

28% 

34% 

46% 

54% 

60% 

62% 


6.2 Logic Network 

To ensure that we simulated networks with high concurrency we constructed “random” logic 
networks having the topology of a butterfly interconnection network. The last stage wraps 
around to feed the first. Each gate was randomly assigned to be an AND, OR, or XOR 
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function and was given a randomly chosen gate delay time of 1,2, or 3 time units. Each gate 
was modeled as an LP. The eventual output of a gate whose inputs have changed can be 
computed at the time the inputs change, hence gate state changes can be pre-sent, A gate is 
like a non-queueing site; its lookahead is computed to be the gate delay plus the minimum 
time of the next input change. The size of network can be described by the dimension of a 
column of gates. For example, a network of dimension 6 has 6 columns, each composed of 
2 6 gates. Observed performance is given below. 


Dim 

5 

6 

7 

8 

9 

10 

11 

P 

24% 

32% 

43% 

52% 

59% 

66% 

70% 


6.3 Conway’s Game of Life 

Initial random configurations were chosen so that the probability of a cell being alive is at 
step 0 is 0.2. Each cell is modeled as an LP. A cell is evaluated at step n only if one of 
its neighbors (or itself) is alive at step n — 1. It is straightforward to pre-send “new state” 
messages; lookahead consists of one step time. The problem size is increased by increasing 
the size of the board. Again, we can easily describe problem size in terms of dimension. A 
2 J x 2 J board will be said to have dimension j. 


Dim 

3 

4 

5 

6 

7 

8 

P 

12% 

16% 

35% 

54% 

69% 

77% 


Larger problems than a 256 x 256 board will often exhaust the available dynamic memory 
in some processor, after some period of execution. This points out one of consequences of 
internally buffering all messages until the window’s workload is completed. 


6.4 Timed Petri Nets 

Consider a timed Petri net model of a multiprocessor system organized with a mesh commu- 
nication topology. The net models a system where a processor iteratively receives a message 
from each of its NEWS (North, East, West, South) neighbors, performs a computation, and 
sends a result to each NEWS neighbor. The net models a flow control policy that prevents a 
processor from sending a message to a neighbor until the last message it sent to that neigh- 
bor is consumed. An LP consists of the network for one processor, a network containing 
approximately thirty places and ten transitions. Nearly all transitions have a unit time delay 
associated with them. Transitions modeling the processor execution time have 200 units of 
delay. 

This Petri net model does not satisfy exactly the assumptions we’ve made concerning 
simulation model behavior. The main difference is that a token arriving to an LP does not 
trigger a single LP activity with a single duration time. The response of the LP is liable 
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to be much more complex. Nevertheless, the basic synchronization protocol works. Tokens 
from an enabled transition are always pre-sent (regardless of whether they are sent to places 
within the LP ); to compute lookahead, an LP adds the minimum delay among all transitions 
that send tokens to other LP s to the least-time token arrival event in the LP ' s event list. 

The grid size for the simulated system can be described in terms of dimension in the 
same way as was the Game of Life. 


Dim 

3 4 5 6 

P 

35% 62% 84% 94% 


The comparatively better performance of this problem can be attributed to its better ratio 
of computation costs to LP-message costs. 


7 Conclusions 

We have analyzed a simple conservative synchronization protocol for parallel discrete-event 
simulation. The protocol presumes that one can pre-sample activity duration times (or bound 
those times from below), that the immediate effects of simulation model state changes are 
very local, and that all queueing disciplines are non-preemptive. The protocol essentially 
slides a window across simulation time; the window is defined so that processors can evalu- 
ate all their window events in parallel. We construct an approximated lower bound on the 
average number of events processed per window. The bound depends on the topology and 
activity rates of the heterogeneous simulation domain. The performance analysis shows that 
a great deal of workload can be performed in parallel, if there is a great deal of concurrent 
activity in the simulation model. Non-zero minimal activity durations are shown to greatly 
improve performance. We show that the asymptotic time complexity of the average total 
overhead (synchronization, lookahead calculations, processor idle time, event list manipula- 
tion) per event is that of of an optimized serial simulation. Assuming that the complexity 
of the communication cost per event is no greater than the overhead of an event in a serial 
implementation, the protocol’s performance is within a constant factor of optimal. The re- 
gion of problems where the method does well is precisely the region where parallel processing 
is most effectively applied — problems too large to run serially. The method is verified by 
implementation on a distributed memory multiprocessor. Good performance is observed on 
a variety of problems. 

A Appendix 

In this appendix we describe the tools used in our analysis, and develop some key results. 
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A.l Stochastic Dominance 


Our analysis relies on the theory of stochastic dominance. The definitions and results we 
cite are taken from Ross [25], chapter 8. 

Random variable X is said to be stochastically larger than random variable Y if for all t 

Pr{X > t} > Pr{T > t}. 

We then write X > st Y, or Y < st X. An equivalent definition is that 

E\g(X )] > E\g(Y) ] for all increasing functions g. 

In particular, E\X } > E\Y], If X% t ... , X n are independent random variables and . . . ,Y n 
are independent random variables such that Xi > st Yi for all i , then for all increasing functions 
9 , 

g(X u ...,X n ) > st g(Y u ...,Y n ). (11) 


A. 2 Hazard Rate Functions 

If A is a nonnegative continuous random variable, it has a hazard rate function , also known 
as a. failure rate function. Let /(f) be X's density function, and let F{t) = Pr{A' > t). Then 
Af’s hazard rate function is defined to be 

= M 

F(ty 

If x is exponential, then A (f) is identically the exponential’s rate parameter. 

We rely on the following results concerning hazard rate functions. 

• If A x(f) and A y(t) are hazard rate functions for X and Y , and A x(f) < A y(t) for all t, 
then X > st Y. 

• If Xi , . . . , X n are independent random variables with hazard rate functions Ai(f), . . . , A n (f), 
then the hazard rate function for min{Ai, . . . , X n } is simply £,=i A,(t). 

• If X has hazard rate function A(f), then for any t and s, s < t, 

Pr{Af > t\X > s} = exp{— f A (u) du }. 

J S 

This also shows (taking s — 0) that the hazard rate function uniquely defines a distri- 
bution. 
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A. 3 Important Bounds 

We now establish some important bounds used in this paper. 

Random variables constructed by randomly choosing one of a set of random variables are 
called mixtures . The following lemma bounds the hazard, rate of a certain class of mixtures. 

Lemma A.l Let Xi, X 2 , . . . , be independent random variables with hazard rate functions 
Ai(tf) , A 2 (i),...„ and suppose that these functions are ordered: 

A i(t) < A i+1 (i) for all 2 = 1,2,.. and all t > 0. 

Let ^ e probabilities such that Pi = T an d consider the random variable M 

constructed by randomly selecting some Xi, with probability p*. Let A m( 1) be M s hazard rate 
function . Then for all t > 0 

00 

MO 

i=l 


Proof: 2 Let fi(t) and F,(<) be the density and cumulative distribution functions for X { . 
Then Xi(t) = fi(t)/F,(t), and 

££1 pJM 


Aa /(0 


££1 


The desired conclusion will follow if 


00 00 00 

£?./<(() < 

i=l i=l i=l 

for all t. Let Y = i with probability p, and let h(Y,t ) = A y(t) and g(Y, t) = — Fy{i). Then 
for every fixed t , h and g are increasing in Y . Application of Proposition 7.1.5 of [2o](p. 
227) yields 

E[h(Y, t)}E[g(Y ; t)} < E[h(Y , t)g(Y \ <)], 

or equivalently, 

1 oo CO oo 

£ < (£p,i ? , (*))(£ PA,(0)- 

t-1 t=l 1 

As this holds for every t > 0, the lemma’s conclusion follows. 

□ 

We now develop a lower bound on the expected minimum of a random number of vari- 
ables, each variable being the sum of two exponentials. 

2 This elegant proof was suggested by an anonymous referee, replacing a far more complicated proof of 
our own. 
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Lemma A. 2 Let S = {(Z 1? L^), (Z 2 , U 2 ), . . . , } be a countable set where Z{ is exponential 
with rate and Ui is exponential with rate ?/>,*, a// £ftese random variables be independent. 

Let B\, ^2 , . . . be the set of finite subsets of S. Let B be a random set constructed by choosing 
Bi with probability pi . Let 

oo 

A = YlM(z i ,u i )eB}\ i A. 

i-l 


Then 


E\ min 
(Z i 9 Ui) € B 


{Zi + u t }} > 



Proof: Consider the hazard rate function 7,*(f) for Z, -f Up This random variable is the 

lifetime of a serial two-stage system where the first stage lasts for time Z,, and the second 
lasts for time Up 7,(i) is the instantaneous probability density associated with the system 
dying at time f, given that it has survived up to time t. Now if Z, > f, the system cannot 
fail at t , whence 7 fit) = 0. If Z t < t , then the hazard rate is simply that of Up. if{. Note that 
this observation relies on the memoryless property of the exponential. We may therefore 
write 


7,(<) = (1 — Pr {Zi >t\Zi + Ui > 
< (1 - Pr {Zi > 

= (1 -exp 


One can show that the left-hand-side of this inequality is equivalent to the more usual (and 
complicated) derivation of the hazard rate function for the sum of two exponentials [29](p. 
126). The function on the right-hand-side is concave in t, and is hence dominated everywhere 
by the line tangent to it at t = 0: r,(<) = A random variable V { with hazard rate 

function t;(/) satisfies < st Z, + [/,. 

Let Bj be any finite subset of S. By (11) and the observations above we may conclude 
that 


E\ min 
(z„u t ) e Bj 


{Zi + Ui}] > E[ min {V-}]. 

(Z| y Ui ) t Bj 


We now focus on the right-hand-side of this inequality. The hazard rate function for Mj = 
min {ViKZ^Ui) € Bj] is simply 


Ab, W = <( E A*-). 


Without loss of generality we may enumerate the finite subsets of S in such a way that if 
i < j, then A B,(t) < A B } {t) for all t. Let M be a mixture of {Mi, M2 ,...}, where Mj is 
chosen with probability p,; let A m(£) be M’s hazard rate function. By Lemma A.l we can 
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bound Am(<) from above by A y(t), defined by 

OO 

A m(0 < 

t=i 

OO 

= E Pr {(Z„ Ui) € B}tA<ifc = X Y {t). 


1=1 


Let y be a random variable with hazard rate function Ay(i). Using the correspondence 
between hazard rate functions and probability distributions (see §A.2), we have 

Pr{ min {Vi} > t] > Pr{F > t } 

(Z it Vi)eB 

= exp{— / Ay(s) ds} 

Jo 

OO 

= exp {—'^2Er{{Zi,U i ) £ B}t 2 \ i ipi/2} 

i=i 


= exp{— Af 2 /2}. 


Now 


r OO 

E[ min {Z,- + Ui}] = / Pr{min{Z, + Ui\(Z u Ui) E B} > t} dt 

(Zi,u,)eB Jo 

roc 

> / exp{— At 2 /2} dt 

Jo 

= exp{— s 2 /2} ds by defining s = ty/X 



□ 
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